Spark cluster

Setup

Get dataset

OpenAQ dataset with air pollution data from a few days will be used.

Polish cities with low PM10

Find cities in Poland, where PM10 pollution was lower than maximal PM10 pollution in Berlin (in the date range used). Perform calculations for data from a few days. Sort results descending by PM10 concentration.

PM2.5 hourly max

For each hour (in UTC time) calculate the highest PM2.5 pollution in selected cities. Sort the results ascending by date.

Cities ranking by PM2.5

Calculate average of PM2.5 pollution for highest $N$ measurements and order cities by this value. Use only cities which have names from small and capital letters from the Latin alphabet.

Pollution change over time

Create visualization to show changes in average pollution over a few days for selected countries.

Big data cluster computations

  1. Perform calculations for task 1 (PM10 concentration) for data from the whole month.
  2. Calculate time of execution for 2, 3, 4, 5, 6, 7 worker instances
  3. Create execution time, speedup and efficiency plots.

Spark UI reports

Perform some calculation on the whole 2020 data. Show examples of Spark UI reports about it, e.g. DAG, Gantt diagram, data size information.

chrome_Yx7jhgBJ6i.png

chrome_ryfZgcOhpR.png

chrome_VUS2c4pT3N.png